Main goals of the session
To put into practice the concepts of species divergence and evolutionary rates that you will learn in the theory lessons, you will analyse nucleotide divergence in two genes (obp83b and rpL32) with different biological functions in five closely related species of the genus Drosophila.
Throughout the document you will see different icons whose meaning is:
: Additional or useful information
: Practical exercise
: Hint to solve an exercise or to do a task
: Slot to answer a question
: Problem or task to be solved
You can use either the R console in the terminal or
RStudio for this practice. If you don’t have R
installed, you can download the appropriate package for your system here. To install
RStudio, go to this page and
follow the instructions.
Before starting the exercises, you will need to install some
necessary libraries for downloading sequences from the GenBank
database, performing multiple sequence alignments, building phylogenetic
trees, and calibrating these trees. Open the R console in
the terminal (or in RStudio) and type:
packages <- c("reutils", "ape", "seqinr", "phangorn", "phylotools")
#install.packages(setdiff(packages, rownames(installed.packages())))
#install.packages(packages)
You can use the efetch() function implemented in the
reutils package to retrieve sequences from databases. These
functions allow you to connect to GenBank and download the
sequences directly to your computer in FASTA format. You will need the
GenBank identifier of either the gene region or the complete
genome sequence of the species and the coordinates of the desired gene
region in that genome. Note that for some species the gene is
encoded in the reverse complement sequence relative to the reference in
the database. In these cases you will need to specify the strand in
efetch. In the example below
Let’s see how to retrieve the coding regions (CDS) of the obp83b gene in different species. Table 1 lists the identifiers and coordinates of this gene region in the five species:
Table 1. Identifiers and genomic coordinates of the obp83b gene in different Drosophila species.
| Species | ID | Coordinates in the genome | Strand |
|---|---|---|---|
| D. melanogaster | NT_033777 | 5976177-5976817 | 2 |
| D. simulans | NC_052523 | 2470507-2471257 | 2 |
| D. erecta | NW_020825194 | 26500029-26500718 | 1 |
| D. elegans | NW_024545863 | 12746083-12746802 | 2 |
| D. pseudopobscura | NC_046679 | 16336491-16337172 | 1 |
GenBank_ identifiers can be obtained from the literature (articles) by starting a search with R libraries such as
rentrez, directly from the web page of this database, usingesearch()fromreutils, or using other programmes and more advanced bioinformatics tools. .
To download and write a file in FASTA format with the CDS sequence of
the obp83b gene in these species, you can use the
efetch() function of reutils.
library(reutils)
library(ape)
## create a new working directory
dir.create("divergence")
Warning in dir.create("divergence"): 'divergence' already exists
## example for D. pseudopobscura:
efetch("NC_046679", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=16336491, seqstop=16337172, outfile="./divergence/Dpse_obp83b_cds.fasta")
#Get D.melanogaster
efetch("NT_033777", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=5976177, seqstop=5976817, outfile="./divergence/Dmel_obp83b_cds.fasta")
#Get D.simulans
efetch("NC_052523", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=2470507, seqstop=2471257, outfile="./divergence/Dsim_obp83b_cds.fasta")
#Get D.erecta
efetch("NW_020825194", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=26500029, seqstop=26500718, outfile="./divergence/Dere_obp83b_cds.fasta")
#Get D.elegans
efetch("NW_024545863", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=12746083, seqstop=12746802, outfile="./divergence/Dele_obp83b_cds.fasta")
#Command line execution (otherwise use read.dna and write.dna)
#cat D*_obp83b_cds.fasta > obp83b_cds.fasta
EntryName<-function(folder, name){
folder<-paste0("./", folder, "/")
f<-paste0(folder, name,"_cds.fasta")
sequence<-read.dna(f, format="fasta")
seqnc<-updateLabel(sequence, labels(sequence), name)
write.dna(seqnc, format="fasta", file=paste0(folder, "obp83b_cds.fasta"), append=TRUE)
}
new_names<-c("Dmel_obp83b",
"Dsim_obp83b",
"Dere_obp83b",
"Dele_obp83b",
"Dpse_obp83b")
for (name in new_names){
EntryName("divergence", name)
}
db, specifies the database;rettype, specifies the subsequence to be extracted (=CDS) and the format (=FASTA);retmode, specifies the file format (=plain text);strand, 1: positive strand, 2: negative (reverse complement with respect to the reference in the database);seqstartandseqstop, are the coordinates of the sequence to be downloaded (required if we are extracting gene regions from the complete genome sequence set).
read.dna() and
write.dna() functions in the ape package.To identify the nucleotide substitutions that have occurred in the
coding region of the opb83b gene during the divergence of the
four species, one must first predict which positions are homologous
(i.e. derived from a common ancestor). There are many algorithms and
programs for constructing multiple sequence alignments of DNA sequences.
For purely practical reasons, in this session you will use
ape, an R package that implements some of
these algorithms.
library(ape)
## read the sequences
myseqs<-read.dna("divergence/obp83b_cds.fasta", format="fasta")
## visualize the sequences
myseqs
5 DNA sequences in binary format stored in a list.
Mean sequence length: 427.8
Shortest sequence: 426
Longest sequence: 429
Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b
Base composition:
a c g t
0.252 0.253 0.271 0.224
(Total: 2.14 kb)
## multiple sequence alignment
myalign <- clustal(myseqs)
## partial visualization of the generated alignment
myalign
5 DNA sequences in binary format stored in a matrix.
All sequences of same length: 429
Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b
Base composition:
a c g t
0.252 0.253 0.271 0.224
(Total: 2.14 kb)
## write a file (maintaining the gaps introduced in the msa step)
write.dna(myalign, file="divergence/obp83b_cds_aln.fasta", format="fasta")
Coding regions are special for molecular evolutionary analyses because they contain two different types of positions in terms of the functional consequences of mutations, namely synonymous and nonsynonymous positions. Synonymous changes are those that do not alter the amino acid encoded by the codon, whereas nonsynonymous changes cause an amino acid substitution in the protein. Consequently, synonymous sites are sites where if a change occurs, it will be a synonymous change. The same applies to nonsynonymous sites. Studying and comparing the variability of these two types of sites is very informative about selective constraints and the functional consequences of mutations.
Divergence is estimated as the number of substitutions per site (this allows the divergence of regions of different lengths to be compared). Therefore, to estimate synonymous and nonsynonymous divergence, we first need to calculate the number of sites of each type in the coding regions of interest. The most commonly used method to estimate these sites is the Nei-Gojobori (N-G) method (Figure 1). The N-G approach consists of estimating the proportion of changes that would be synonymous and nonsynonymous in a given codon after considering all nine possible nucleotide changes (based on the genetic code).
Figure 1. Nei-Gojobori method for estimating the number of synonymous and nonsynonymous sites in a codon
Here are some examples of functions for working with changes and synonymous sites:
library(seqinr)
Attaching package: 'seqinr'
The following objects are masked from 'package:ape':
as.alignment, consensus
translate<-seqinr::translate
## function to translate a codon:
dna_to_aa <- function(codon) {
dna<-s2c(codon)
aa<-translate(dna, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
return (aa)
}
## function to determine if a change between two codons is synonym
is_synonymous<-function(codon1, codon2) {
return (dna_to_aa(codon1) == dna_to_aa(codon2))
}
## function to estimate the number of synonymous sites in a codon:
synpos<-function(codon) {
pos<-s2c(codon)
for (i in 1:length(pos)) {
base = pos[i]
BASES = c('A', 'G', 'T', 'C')
other_bases = BASES[BASES!=base]
syn=0
for (new_base in other_bases) {
new_pos=c(pos[0:(i-1)], new_base, pos[-(1:i)])
new_codon=paste(new_pos, collapse="")
s<-(is_synonymous(codon, new_codon))
syn <- syn + length(s[s==TRUE])
}
synp <- syn/3
}
return (synp)
}
Calculate the number of synonymous and nonsynonymous changes and sites in this coding sequence alignment using the functions above:
Seq 1 ATGATGCAGAGTCTGTAA Seq 2 ATGAGGCACAGTCTGTAA
The number of sites
of each class in each sequence is the sum over all codons in the
sequence, and the total number of sites of each class in the alignment
is the average over all sequences. You can apply the functions
separately for each codon, or consider a loop integrating functions such
as s2c() and splitseq() from the
seqinr package
seq1<-"ATGATGCAGAGTCTGTAA"
seq2<-"ATGAGGCACAGTCTGTAA"
aa1<-dna_to_aa(seq1)
aa2<-dna_to_aa(seq2)
syn_list<-is_synonymous(seq1,seq2)
syn<-sum(syn_list)
nonsyn<-length(syn_list)-syn
changeCounter<-function(seq){
i=3
pre=0
seq_Syn=0
while (i<=nchar(seq)){
s<-substr(seq, pre, i)
seq_Syn=seq_Syn+synpos(s)
pre=i+1
i=i+3
}
seq_NSyn=nchar(seq)-seq_Syn
#syn=paste0("Synonymous: ", seq_Syn)
#nsyn=paste0("Non-synonymous: ", seq_NSyn)
return(c(seq_Syn, seq_NSyn))
}
seq1_data<-changeCounter(seq1)
seq2_data<-changeCounter(seq2)
syn_sites<-(seq1_data[1]+seq2_data[1])/2
nsyn_sites<-(seq1_data[2]+seq2_data[2])/2
Seq1:
Synonymous changes: 2
Non-synonymous changes: 16
Seq2:
Synonymous changes: 2.333333
Non-synonymous changes: 15.66667
Synonymous sites: 2.166667
Non-synonymous sites: 15.83333
Fortunately, you will not need to repeat this calculation for all the
gene sequences we will be studying in this lab. There are R
functions that can be used to calculate the synonymous and nonsynonymous
divergence of a coding region directly from the CDS alignment. The
seqinr package implements the KaKs() function,
which allows the calculation of these divergences. Before proceeding
with the calculations, we will change the sequence names in the fasta
file to make it easier to identify the species in the results matrix
(now the identifiers are very long and some of them do not contain the
species name):
library(phylotools)
Attaching package: 'phylotools'
The following object is masked from 'package:seqinr':
read.fasta
old_names<-get.fasta.name("divergence/obp83b_cds_aln.fasta")
## new names must be in the same order than old names...
new_names<-c("Dmel_obp83b",
"Dsim_obp83b",
"Dere_obp83b",
"Dele_obp83b",
"Dpse_obp83b")
ref2 <- data.frame(old_names, new_names)
rename.fasta(infile = "divergence/obp83b_cds_aln.fasta", ref_table = ref2, outfile = "divergence/obp83b_cds_renamed_aln.fasta")
divergence/obp83b_cds_renamed_aln.fasta has been saved to /home/jj/Desktop/Bioinformatics/2nd_year/3term/Population_Genetics_ME/Seminars/P4_Estimating_nucleotide_divergences+evolutionary_rates
cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")
## matrix with synonymous divergences (Ks)
kaks(cds)$ks
Dele_obp83b Dere_obp83b Dmel_obp83b Dpse_obp83b
Dere_obp83b 0.6182696
Dmel_obp83b 0.5494927 0.3234201
Dpse_obp83b 1.0414168 1.4340955 0.9158393
Dsim_obp83b 0.7103626 0.3617048 0.1256942 0.9289957
## matrix with nonsynonymous divergences (Ka)
kaks(cds)$ka
Dele_obp83b Dere_obp83b Dmel_obp83b Dpse_obp83b
Dere_obp83b 0.03576413
Dmel_obp83b 0.03749267 0.03931737
Dpse_obp83b 0.07112614 0.08614269 0.07829961
Dsim_obp83b 0.02529429 0.02555604 0.01992201 0.07511411
In addition to the CDS, noncoding sequences are also found in eukaryotic genes. These sequences correspond to introns, exonic untranslated regions (5’ or 3’) or 5’ proximal regions, often containing promoters and regulatory elements. Here, you will analyse the divergence in the 5’ proximal region and the introns of the obp83b gene in the same five species.
First, you will obtain noncoding sequences of this gene for all
species. To do this, you need to know exon coordinates. Again, you can
use reutils functions to read the GenBank entries in your
R terminal:
## example for D. pseudoobscura:
x<-efetch("NC_046679", db="nucleotide", rettype="gb", retmode="text", strand=1, seqstart=16336491, seqstop=16337172)
con <- content(x, as = "textConnection")
readLines(con)
[1] "LOCUS NC_046679 682 bp DNA linear CON 14-APR-2020"
[2] "DEFINITION Drosophila pseudoobscura strain MV2-25 chromosome 2, UCI_Dpse_MV25,"
[3] " whole genome shotgun sequence."
[4] "ACCESSION NC_046679 REGION: 16336491..16337172"
[5] "VERSION NC_046679.1"
[6] "DBLINK BioProject: PRJNA622252"
[7] " BioSample: SAMN13616452"
[8] " Assembly: GCF_009870125.1"
[9] "KEYWORDS WGS; RefSeq."
[10] "SOURCE Drosophila pseudoobscura"
[11] " ORGANISM Drosophila pseudoobscura"
[12] " Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta;"
[13] " Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;"
[14] " Muscomorpha; Ephydroidea; Drosophilidae; Drosophila; Sophophora."
[15] "COMMENT REFSEQ INFORMATION: The reference sequence is identical to"
[16] " CM020868.1."
[17] " Assembly name: UCI_Dpse_MV25"
[18] " The genomic sequence for this RefSeq record is from the"
[19] " whole-genome assembly released by the University of California,"
[20] " Irvine on 2020/01/15. The original whole-genome shotgun project has"
[21] " the accession WVEN00000000.1."
[22] " "
[23] " ##Genome-Assembly-Data-START##"
[24] " Assembly Provider :: University of California, Irvine"
[25] " Assembly Method :: Canu v. 1.7"
[26] " Assembly Name :: UCI_Dpse_MV25"
[27] " Genome Representation :: Full"
[28] " Expected Final Version :: Yes"
[29] " Genome Coverage :: 280.0x"
[30] " Sequencing Technology :: PacBio Sequel"
[31] " ##Genome-Assembly-Data-END##"
[32] " "
[33] " ##Genome-Annotation-Data-START##"
[34] " Annotation Provider :: NCBI"
[35] " Annotation Status :: Full annotation"
[36] " Annotation Name :: Drosophila pseudoobscura Annotation"
[37] " Release 104"
[38] " Annotation Version :: 104"
[39] " Annotation Pipeline :: NCBI eukaryotic genome annotation"
[40] " pipeline"
[41] " Annotation Software Version :: 8.4"
[42] " Annotation Method :: Best-placed RefSeq; Gnomon"
[43] " Features Annotated :: Gene; mRNA; CDS; ncRNA"
[44] " ##Genome-Annotation-Data-END##"
[45] "FEATURES Location/Qualifiers"
[46] " source 1..682"
[47] " /organism=\"Drosophila pseudoobscura\""
[48] " /mol_type=\"genomic DNA\""
[49] " /strain=\"MV2-25\""
[50] " /db_xref=\"taxon:7237\""
[51] " /chromosome=\"2\""
[52] " /sex=\"female\""
[53] " /tissue_type=\"whole body\""
[54] " /dev_stage=\"adult\""
[55] " gene 1..682"
[56] " /gene=\"LOC4801921\""
[57] " /note=\"Derived by automated computational analysis using"
[58] " gene prediction method: Gnomon.\""
[59] " /db_xref=\"GeneID:4801921\""
[60] " mRNA join(1..128,186..261,317..682)"
[61] " /gene=\"LOC4801921\""
[62] " /product=\"pheromone-binding protein-related protein 6\""
[63] " /note=\"Derived by automated computational analysis using"
[64] " gene prediction method: Gnomon. Supporting evidence"
[65] " includes similarity to: 108 Proteins, and 100% coverage of"
[66] " the annotated genomic feature by RNAseq alignments,"
[67] " including 102 samples with support for all annotated"
[68] " introns\""
[69] " /transcript_id=\"XM_001358902.3\""
[70] " /db_xref=\"GeneID:4801921\""
[71] " CDS join(54..128,186..261,317..594)"
[72] " /gene=\"LOC4801921\""
[73] " /note=\"Derived by automated computational analysis using"
[74] " gene prediction method: Gnomon.\""
[75] " /codon_start=1"
[76] " /product=\"pheromone-binding protein-related protein 6\""
[77] " /protein_id=\"XP_001358939.1\""
[78] " /db_xref=\"GeneID:4801921\""
[79] " /translation=\"MMKSVGLLMLLLGCVAAQGPRRDAEYPPPAILKLAQHFHDICAP"
[80] " KTGVTDAAIKEFSDGEIHEDENLKCYMNCLFHEFEVVDDNGDVHMEKLFNAVPSEKLR"
[81] " NVLMEASKDCIHPEGDTLCHKAWWFHQCWKKADPVHYFLV\""
[82] "ORIGIN "
[83] " 1 ccagtccttt gccctcagtc cgctcagtgt ttttgaaaac ttgttgcctg aaaatgatga"
[84] " 61 aatctgttgg tcttctcatg ctgctacttg gctgtgtagc tgcccaggga cccagacggg"
[85] " 121 atgctgaggt ttgtagtgga ttcatgacac ttgtaccatt tatcagttat tatctctcgc"
[86] " 181 cgcagtatcc accgccagca attctaaaat tggcccaaca cttccacgat atttgtgccc"
[87] " 241 caaaaactgg tgtcactgat ggtaggtcct actctttacc attttttcga ttactcttca"
[88] " 301 atttcggtct ctttagcggc catcaaggag ttcagcgacg gagagataca tgaggacgag"
[89] " 361 aatctcaagt gctacatgaa ctgtctcttc cacgagttcg aagtggtcga tgataatggc"
[90] " 421 gatgtacata tggagaagct attcaatgcc gttccctcgg aaaagctgcg caacgtattg"
[91] " 481 atggaggcct ccaaggattg catccatccg gagggcgaca ccttgtgcca caaggcctgg"
[92] " 541 tggttccatc aatgctggaa gaaagctgat cctgtccact atttcttggt ctaagatgta"
[93] " 601 aatgtctttg caatatttat gtttatgtga atgtgtgttt atattaagct taataaatgg"
[94] " 661 ttaaaggttc atttaaacat ta"
[95] "//"
[96] ""
[97] ""
#Get D.melanogaster
y<-efetch("NT_033777", db="nucleotide", rettype="gb", retmode="text", strand=2, seqstart=5976177, seqstop=5976817)
con <- content(y, as = "textConnection")
#readLines(con)
#Get D.simulans
z<-efetch("NC_052523", db="nucleotide", rettype="gb", retmode="text", strand=2, seqstart=2470507, seqstop=2471257)
con <- content(z, as = "textConnection")
#readLines(con)
#Get D.erecta
a<-efetch("NW_020825194", db="nucleotide", rettype="gb", retmode="text", strand=1, seqstart=26500029, seqstop=26500718)
con <- content(a, as = "textConnection")
#readLines(con)
#Get D.elegans
b<-efetch("NW_024545863", db="nucleotide", rettype="gb", retmode="text", strand=2, seqstart=12746083, seqstop=12746802)
con <- content(b, as = "textConnection")
#readLines(con)
The
readLines()function prints the entry in GenBank format. Locate the feature “CDS” to find the coordinates of the coding region of the gene obp83b in this species.
Once you know the coordinates of CDS, you can extract 5’ and intron sequences:
## example for D.pseudobscura
## download the complete gene region
efetch("NC_046679", db="nucleotide", rettype="fasta", retmode="text", strand=1, seqstart=16336491, seqstop=16337172, outfile="divergence/Dpse_obp83b.fasta")
## create an object with the sequence of the complete gene region
sequence<-read.dna("divergence/Dpse_obp83b.fasta", format="fasta")
## create a new sequence with only intron sequences
## CDS = join(54..128,186..261,317..594)
## 5' region = 1-53
## intron 1 = 129-185
## intron2 = 262-316
seqnc<-sequence
seqnc=seqnc[,c(1:53,129:185,262:316)]
## change species label
seqnc<-updateLabel(seqnc, labels(seqnc), "Dpse_obp83b")
## write a fasta file with noncoding sequences of the opb83b gene of ALL SPECIES (with append=TRUE, if your run this script for all species, all sequences will be added to the same file; remember to change the ids in each case)
write.dna(seqnc, format="fasta", file="divergence/obp83b_cds.fasta", append=TRUE)
Now we can align the non-coding sequences and estimate the number of
substitutions per site (KNC) of the obp83b
gene. For nucleotide divergences (without any distinction of positions,
just total divergence) you can use the function dist.dna()
from the package ape.
## read the noncoding sequences
myseqs<-read.dna("divergence/obp83b_cds.fasta", format="fasta")
## visualize the sequences
myseqs
6 DNA sequences in binary format stored in a list.
Mean sequence length: 384
Shortest sequence: 165
Longest sequence: 429
Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b
Dpse_obp83b
Base composition:
a c g t
0.247 0.252 0.263 0.237
(Total: 2.3 kb)
## multiple sequence alignment
myalign <- clustal(myseqs)
## partial visualization of the generated alignment
myalign
6 DNA sequences in binary format stored in a matrix.
All sequences of same length: 430
Labels:
Dmel_obp83b
Dsim_obp83b
Dere_obp83b
Dele_obp83b
Dpse_obp83b
Dpse_obp83b
Base composition:
a c g t
0.247 0.252 0.263 0.237
(Total: 2.58 kb)
## write the alignment to file
write.dna(myalign, file="divergence/obp83b_nc_aln.fasta", format="fasta")
## read the noncoding alignment
nc <- read.dna("divergence/obp83b_nc_aln.fasta", format="fasta")
## noncoding divergences in the opb83b gene
dist.dna(nc)
Dmel_obp83b Dsim_obp83b Dere_obp83b Dele_obp83b Dpse_obp83b
Dsim_obp83b 0.06368852
Dere_obp83b 0.12667685 0.09786105
Dele_obp83b 0.11196991 0.12706643 0.13419579
Dpse_obp83b 0.14909196 0.15745580 0.22224915 0.13464234
Dpse_obp83b 0.57679173 0.53319951 0.50684131 0.51517541 0.51007558
Fill empty cells in the following tables with the results obtained in the different divergence analyses and answer the questions:
Table 1. Divergence in the coding region of the gene opb83b(Ka and Ks values are above and below the diagonal, respectively)
| Dmel | Dsim | Dere | Dele | Dpse | |
|---|---|---|---|---|---|
| Dmel | ——— | 0.01992201 | 0.03931737 | 0.03749267 | 0.07829961 |
| Dsim | 0.1256942 | ——— | 0.02555604 | 0.02529429 | 0.07511411 |
| Dere | 0.3234201 | 0.3617048 | ——— | 0.03576413 | 0.08614269 |
| Dele | 0.5494927 | 0.7103626 | 0.6182696 | ——— | 0.07112614 |
| Dpse | 0.9158393 | 0.9289957 | 1.4340955 | 1.0414168 | ——— |
Divergence in the non-coding regions of the gene opb83b (Fill in the table with KNC values above the diagonal)
| Dmel | Dsim | Dere | Dele | Dpse | |
|---|---|---|---|---|---|
| Dmel | ——— | 0.06368852 | 0.12667685 | 0.11196991 | 0.14909196 |
| Dsim | ——— | ——— | 0.09786105 | 0.12706643 | 0.15745580 |
| Dere | ——— | ——— | ——— | 0.13419579 | 0.22224915 |
| Dele | ——— | ——— | ——— | ——— | ——— |
| Dpse | ——— | ——— | ——— | ——— | ——— |
Questions:
1. Is the obp83b gene evolving rapidly? And the Obp83b protein?
The average values for each species are as follows: Dmel: 0.47611, Dsim: 0.66702, Dere: 1.02618, and Dele: 1.041416. These values show a significant increase, suggesting a rapid evolution of genes.
For the protein, the averages for non-synonymous substitutions are: Dmel: 0.043757, Dsim: 0.041988, Dere: 0.060953, and Dele: 0.0711261. When we compare these to the synonymous substitution averages, we observe that the non-synonymous averages are significantly lower. This indicates negative selection, suggesting that the protein is also undergoing evolutionary changes.
2. In which of the analysed positions (synonymous, nonsynonymous or noncoding) are evolutionary distances higher? What is the reason for this, in your opinion?
For synonymous positions. This may be because they don't affect the function of the gene, they have no functional constraints, so more of them can happen without impacting the organism negatively.
3. Could you consider some of the observed changes as selectively neutral? Which ones?
The mutations in non-coding regions can be considered as selectively neutral changes because they don't affect the evolution of genes
4. Which species would share a more recent common ancestor? In which evolutionary distances (synonymous, nonsynonymous or noncoding) should we focus to know that?
We have observe the non synonymous regions for the smallest difference, the smallest difference in divergence is Dsim with Dmel with a difference in divergence of 0.0018 aprox.
Some of the most direct applications of nucleotide divergence
estimates are the reconstruction of phylogenetic relationships and the
inference of evolutionary rates. To reconstruct the phylogenetic
relationships among the five Drosophila species using the
information form the obp83b gene, you can use the functions
from ape and phangorn packages.
As an example, the following code reconstructs the phylogenetic relationships between the five species based on synonymous divergences. It uses the synonymous divergence matrix estimated above and the Neighbor Joining algorithm
library(phangorn)
cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")
## NJ tree
syntree <- NJ(kaks(cds)$ka)
plot(syntree)
write.tree(syntree, file="divergence/obp83b_Ks.tree")
By definition, a Neighbor Joining tree is an unrooted tree. There is no root node that is the common ancestor of all species. With an unrooted tree, we can tell the phylogenetic relationships between species (i.e. which species have a more recent common ancestor compared to the other species), but not the direction of evolution (i.e. which species arose more recently). In addition, branch lengths represent substitutions per site, in this case the number of substitutions per site. Sometimes, as in our example, we know which branch contains the root of the species under study (see Figure 2).
Figure 2. Phylogenetic relationships among the species of the genus Drosophila(modified from Mol Ecol Resour. 2022;22:1559–1581)
To calculate the evolutionary rate of the Obp83b protein, we need an ultrametic tree, i.e. a rooted tree in which the branch lengths represent time (not substitutions) and at least one calibration point (i.e. a dated node).
## read the tree to a phylo class object
tree<-read.tree(file="divergence/obp83b_Ks.tree")
## plot the unroted tree
plot(tree, main="Obp gene tree (number of nonsynonymous substitutions per site)")
## root the tree using Drosophila pseudoobscura as the outgroup
rooted<-root(tree, "Dpse_obp83b")
plot(rooted)
## set calibration
## select the node corresponding to the ancestor of D. melanogater, D. simulans and D. erecta, and specify a range of 6.1-18.9 MYA
mrca_node <- getMRCA(rooted, c("Dmel_obp83b", "Dsim_obp83b", "Dere_obp83b"))
cal <- data.frame(
node = mrca_node,
age.min = 6.1,
age.max = 18.9
)
## fit the model (penalized likelihood)
chr<-chronos(rooted, model="clock", calibration = cal)
Setting initial dates...
Fitting in progress... get a first set of estimates
(Penalised) log-lik = -0.5106124
Optimising rates... dates... -0.5106124
log-Lik = -0.5106124
PHIIC = 9.02
# plot de calibrated (ultrametric) tree
plot(chr, main="calibrated tree (million years)")
axisPhylo()
tiplabels()
nodelabels()
chr$edge
[,1] [,2]
[1,] 6 8
[2,] 8 7
[3,] 7 1
[4,] 7 2
[5,] 8 5
[6,] 6 3
[7,] 6 4
chr$edge.length
[1] 10.390326 5.580626 13.319373 13.319373 18.899999 29.290325 29.290325
The
tiplabels()andnodelabels()functions will tell you the id of each node in the tree (they are indicated in the tree). Withchr$edgeandchr$edge.lengthyou can find the correspondence between edge numbers and edge lengths, and thus the estimated date in million years of each node.
You now have all the parameters you need to calculate the evolutionary rate of the Obp83b protein in Drosophila.
Calculate the evolutionary rate (r) of the Obp83b protein in Drosophila using the divergences and times estimated in this practice. Think carefully about which divergence to use for this calculation and apply the formula.
kaks(cds)$ka
Dele_obp83b Dere_obp83b Dmel_obp83b Dpse_obp83b
Dere_obp83b 0.03576413
Dmel_obp83b 0.03749267 0.03931737
Dpse_obp83b 0.07112614 0.08614269 0.07829961
Dsim_obp83b 0.02529429 0.02555604 0.01992201 0.07511411
divergence_times <- c(dele_dere = 21, dele_dpse = 18, dele_dsim = 22, dele_dmel = 22)
ks_values <- c(dele_dere = 0.03576413, dele_dpse = 0.07112614, dele_dsim = 0.02529429, dele_dmel = 0.03749267)
evolutionary_rates <- ks_values /2*divergence_times
mean_evolutionary_rate <- mean(evolutionary_rates)
mean_evolutionary_rate
[1] 0.4265788
cat("The evolutionary rate is", mean_evolutionary_rate)
The evolutionary rate is 0.4265788
Table 2. Identifiers and genomic coordinates of the rpL32 gene in different Drosophila species.
| Species | ID | Coordinates in the genome | Strand |
|---|---|---|---|
| D. melanogaster | NT_033777 | 30045229-30046161 | 2 |
| D. simulans | NC_052523 | 26165102-26166121 | 2 |
| D. erecta | NW_020825194 | 2228933..2229932 | 1 |
| D. serrata | NW_018367383 | 6093667..6094458 | 2 |
| D. pseudopobscura | NC_046679 | 811678-812625 | 1 |
Using the data in Table 2, calculate the rate of evolution of the protein RpL32 in Drosophila and answer the following questions. Note that for this gene D. elegans is replaced by D. serrata.
dir.create("Final_ass")
Warning in dir.create("Final_ass"): 'Final_ass' already exists
#RUN ONLY THE FIRST TIME
#efetch("NT_033777", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart= 30045229, seqstop=30046161, outfile="./Final_ass/Dmel_cds.fasta")
#efetch("NC_052523", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=26165102, seqstop=26166121, outfile="./Final_ass/Dsim_cds.fasta")
#efetch("NW_020825194", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=2228933, seqstop=2229932, outfile="./Final_ass/Dere_cds.fasta")
#efetch("NW_018367383", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=2, seqstart=6093667, seqstop=6094458, outfile="./Final_ass/Dser_cds.fasta")
#efetch("NC_046679", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=811678, seqstop=812625, outfile="./Final_ass/Dpse_cds.fasta")
#Delete entry repetitions in Dere_obp83b_cds.fasta, Dmel_obp83b_cds, Dsim_obp83b_cds
new_names<-c("Dmel",
"Dsim",
"Dere",
"Dser",
"Dpse")
for (name in new_names){
EntryName("Final_ass", name)
}
system("mv Final_ass/obp83b_cds.fasta Final_ass/fa_cds.fasta")
## read the noncoding sequences
myseqs<-read.dna("Final_ass/fa_cds.fasta", format="fasta")
## multiple sequence alignment
myalign <- clustal(myseqs)
## write the alignment to file
write.dna(myalign, file="Final_ass/fa_aln.fasta", format="fasta")
a<-read.dna("Final_ass/fa_aln.fasta", format="fasta")
old_names<-get.fasta.name("Final_ass/fa_aln.fasta")
new_names<-c("Dmel",
"Dsim",
"Dere",
"Dser",
"Dpse")
ref2 <- data.frame(old_names, new_names)
rename.fasta(infile = "Final_ass/fa_aln.fasta", ref_table = ref2, outfile = "Final_ass/fa_renamed_aln.fasta")
Final_ass/fa_renamed_aln.fasta has been saved to /home/jj/Desktop/Bioinformatics/2nd_year/3term/Population_Genetics_ME/Seminars/P4_Estimating_nucleotide_divergences+evolutionary_rates
## read the noncoding alignment
cds <- read.alignment("Final_ass/fa_renamed_aln.fasta", format="fasta")
kaks(cds)$ks #synonymous
Dere Dmel Dpse Dser
Dmel 0.04469301
Dpse 0.50454520 0.51472975
Dser 0.52481511 0.52849324 0.60968113
Dsim 0.03733297 0.02182090 0.50377590 0.52545067
kaks(cds)$ka #Non-synomymous
Dere Dmel Dpse Dser
Dmel 0.00000000
Dpse 0.01145412 0.01143935
Dser 0.01139894 0.01236601 0.01435741
Dsim 0.00000000 0.00000000 0.01143935 0.01236601
library(phangorn)
cds <- read.alignment("Final_ass/fa_renamed_aln.fasta",format="fasta")
syntree <- NJ(kaks(cds)$ks)
write.tree(syntree, file="Final_ass/fa_Ks.tree")
plot(syntree)
tree<-read.tree(file="Final_ass/fa_Ks.tree")
rooted<-root(tree, "Dere") #We choose D.erecta as outgroup because it is not in any cluster
mrca_node <- getMRCA(rooted, c("Dser", "Dpse")) #I choose this two because they form a cluster
#I use the same distances that in the example
cal <- data.frame(
node = mrca_node,
age.min = 6.1,
age.max = 18.9
)
chr<-chronos(rooted, model="clock", calibration = cal)
Setting initial dates...
Fitting in progress... get a first set of estimates
(Penalised) log-lik = -2.295721
Optimising rates... dates... -2.295721
log-Lik = -2.295721
PHIIC = 12.59
plot(chr, main="calibrated tree (million years)")
axisPhylo()
tiplabels()
nodelabels()
chr$edge
[,1] [,2]
[1,] 6 7
[2,] 7 1
[3,] 7 2
[4,] 6 8
[5,] 8 3
[6,] 8 4
[7,] 6 5
chr$edge.length
[1] 10.838461 1.098351 1.098351 3.547680 8.389132 8.389132 11.936811
5. Which is the evolutionary rate of the RpL32 protein?
divergence_times <- c(dmel_dsim = 6.1, dmel_dere = 66, dmel_dser = 66, dmel_dpse = 66)
ks_values <- c(dmel_dsim = 0.021182090, dmel_dere = 0.04469301, dmel_dser = 0.01236601, dmel_dpse = 0.01143935)
evolutionary_rates <- ks_values /2*divergence_times
mean_evolutionary_rate <- mean(evolutionary_rates)
mean_evolutionary_rate
[1] 0.5812629
The evolutionary rate is 0.5812629
6. Are the evolutionary rates of the two proteins studied in this practice different?
Yes, the evolutionary rate of obp83b was 0.4265788 while this one is 0.5812629
7. Can you provide a molecular evolutionary explanation for the answer to question 6?
Because of difference of presence of variables like functional constraints, selective pressures, genetic redundancy, environmental interactions, and protein essentiality which can account for the more rapid evolutionary rate of one of the 2 proteins
8. Which of the other positions examined here (i.e. nonsynonymous, noncoding, the full cds…) could have been used to reconstruct the phylogenetic tree prior to calibration?
The CDS could have been used, particularly when combined with synonymous and noncoding regions to balance the evolutionary signals.
9. Looking at your results and the tree in Figure 2, do you think that the fact that some of the species are not the same in both analyses (D. elegans in obp83b and D. serrata in rpL32) influences the results? Discuss
The fact that they are different species probably does affect the analysis, between species there may be differences like observed evolutionary rates, divergence times, selective pressures, and lineage-specific evolutionary dynamics, which affect the analysis. We chould use the same set of species for both genes for a better analysis.
Submit this document with your
answers (or just a document with your answers to the questions in any
readable format, e.g. word, pdf, plain text…), and the R
code used to complete the final assignment to
AULAESCI
Deadline: June 28, 2024
Doubts? mailto:alejandro.sanchez@prof.esci.upf.edu